Earthquakes are one of the most destructive natural disasters that
can strike without warning, causing extensive damage to infrastructure,
loss of life, and massive economic losses. While we cannot prevent
earthquakes from occurring, the ability to accurately predict when and
where they might occur could save countless lives and minimize the
damage caused.
Therefore, our aim with this report is to contribute the significant
earthquake prediction which enables to provide advanced warning of
potentially catastrophic seismic events, allowing governments and
communities to prepare and take necessary measures to minimize the
impact of such events.
The Significant Earthquake Database contains information on
destructive earthquakes from 2150 B.C. to the present that meet at least
one of the following criteria: Moderate damage (approximately $1 million
or more), 10 or more deaths, Magnitude 7.5 or greater, Modified Mercalli
Intensity X or greater, or the earthquake generated a tsunami. The
database can also be displayed and extracted with the Natural Hazards
Interactive Map.
Below we are listing a short summary of our main variables. At the
primary and secondary deaths and damages where available the total
numbers have been added to the dataset, in the “description” field the
variables have already been clustered.
Valid values: 0.0 to 9.9
The value in this column contains the primary earthquake magnitude.
Magnitude measures the energy released at the source of the earthquake.
Magnitude is determined from measurements on seismographs. For
pre-instrumental events, the magnitudes are derived from intensities.
There are several different scales for measuring earthquake magnitudes.
The primary magnitude is chosen from the available magnitude scales in
this order:
Mw Magnitude
Ms Magnitude
Mb Magnitude
Ml Magnitude
Mfa Magnitude
Unknown Magnitude
Valid values: 1 to 12
The Modified Mercalli Intensity (Int) is given in Roman Numerals
(converted to numbers in the digital database). An interpretation of the
values is listed below.
Table 1. Modified Mercalli Intensity Scale of 1931
Not felt except by a very few under especially favorable
circumstances.
Felt only by a few persons at rest, especially on upper floors of
buildings. Delicately suspended objects may swing.
Felt quite noticeably indoors, especially on upper floors of
buildings, but many people do not recognize it as an earthquake.
Standing motor cars may rock slightly. Vibration like passing truck.
Duration estimated.
During the day felt indoors by many, outdoors by few. At night
some awakened. Dishes, windows, and doors disturbed; walls make creaking
sound. Sensation like heavy truck striking building. Standing motorcars
rock noticeably.
Felt by nearly everyone; many awakened. Some dishes, windows,
etc., broken; a few instances of cracked plaster; unstable objects
overturned. Disturbance of trees, poles, and other tall objects
sometimes noticed. Pendulum clocks may stop.
Felt by all; many frightened and run outdoors. Some heavy
furniture moved; a few instances of fallen plaster or damaged chimneys.
Damage slight.
Everybody runs outdoors. Damage negligible in buildings of good
design and construction slight to moderate in well built ordinary
structures; considerable in poorly built or badly designed structures.
Some chimneys broken. Noticed by persons driving motor cars.
Damage slight in specially designed structures; considerable in
ordinary substantial buildings, with partial collapse; great in poorly
built structures. Panel walls thrown out of frame structures. Fall of
chimneys, factory stacks, columns, monuments, walls. Heavy furniture
overturned. Sand and mud ejected in small amounts. Changes in well
water. Persons driving motor cars disturbed.
Damage considerable in specially designed structures;
well-designed frame structures thrown out of plumb; great in substantial
buildings, with partial collapse. Buildings shifted off foundations.
Ground cracked conspicuously. Underground pipes broken.
Some well-built wooden structures destroyed; most masonry and
frame structures destroyed with foundations; ground badly cracked. Rails
bent. Landslides considerable from river banks and steep slopes. Shifted
sand and mud. Water splashed over banks.
Few, if any (masonry), structures remain standing. Bridges
destroyed. Broad fissures in ground. Underground pipelines completely
out of service. Earth slumps and land slips in soft ground. Rails bent
greatly.
Damage total. Waves seen on ground surfaces. Lines of sight and
level distorted. Objects thrown upward into the air.
The depth of the earthquake is given in kilometers.
Associated Tsunami or Seiche [Tsu]
When a tsunami or seiche was generated by an earthquake, An icon appears
in the Associated Tsunami column which is linked to the tsunami event
database. The link will display additional tsunami event
information.
Volcano [Vol]
The Volcano link will display additional information if the earthquake
was associated with a volcanic eruption. The information may include
information such as the VEI index, morphology, and the effects of the
eruption.
Description of Deaths from the Earthquake
[Death.Description]
Valid values: 0 to 4
When a description was found in the historical literature instead of
an actual number of deaths, this value was coded and listed in the
Deaths column. If the actual number of deaths was listed, a descriptor
was also added for search purposes.
0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)
Description of Damage from the Earthquake
[Damage.Description]
Valid values: 0 to 4
For those events not offering a monetary evaluation of damage, the
following five-level scale was used to classify damage (1990 dollars)
and was listed in the Damage column. If the actual dollar amount of
damage was listed, a descriptor was also added for search purposes.
0 NONE
1 LIMITED (roughly corresponding to less than $1 million)
2 MODERATE (~$1 to $5 million)
3 SEVERE (~$5 to $25 million)
4 EXTREME (~$25 million or more)
Description of Deaths from the Earthquake and secondary
effects (eg Tsunami)
Valid values: 0 to 4
When a description was found in the historical literature instead of
an actual number of deaths, this value was coded and listed in the
Deaths column. If the actual number of deaths was listed, a descriptor
was also added for search purposes.
0 None
1 Few (~1 to 50 deaths)
2 Some (~51 to 100 deaths)
3 Many (~101 to 1000 deaths)
4 Very many (over 1000 deaths)
At first we will have a look at the data provided:
str(eqdata)
## 'data.frame': 6365 obs. of 39 variables:
## $ Search.Parameters : chr "[]" "" "" "" ...
## $ Year : int NA -2150 -2000 -2000 -1610 -1566 -1450 -1365 -1250 -1050 ...
## $ Mo : int NA NA NA NA NA NA NA NA NA NA ...
## $ Dy : int NA NA NA NA NA NA NA NA NA NA ...
## $ Hr : int NA NA NA NA NA NA NA NA NA NA ...
## $ Mn : int NA NA NA NA NA NA NA NA NA NA ...
## $ Sec : num NA 0 NA NA NA 0 NA NA 0 0 ...
## $ Tsu : int NA NA 1 NA 3 NA NA 4 NA NA ...
## $ Vol : int NA NA NA NA 1351 NA NA NA NA NA ...
## $ Location.Name : chr "" "JORDAN: BAB-A-DARAA,AL-KARAK" "SYRIA: UGARIT" "TURKMENISTAN: W" ...
## $ Latitude : num NA 31.1 35.7 38 36.4 ...
## $ Longitude : num NA 35.5 35.8 58.2 25.4 35.3 25.5 35.8 35.5 35 ...
## $ Focal.Depth..km. : int NA NA NA 18 NA NA NA NA NA NA ...
## $ Mag : num NA 7.3 NA 7.1 NA NA NA NA 6.5 6.2 ...
## $ MMI.Int : int NA NA 10 10 NA 10 10 NA NA NA ...
## $ Deaths : int NA NA NA 1 NA NA NA NA NA NA ...
## $ Death.Description : int NA NA 3 1 NA NA NA NA NA NA ...
## $ Missing : int NA NA NA NA NA NA NA NA NA NA ...
## $ Missing.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Injuries.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Damage...Mil. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Damage.Description : int NA 3 NA 1 NA 3 NA 3 3 3 ...
## $ Houses.Destroyed : int NA NA NA NA NA NA NA NA NA NA ...
## $ Houses.Destroyed.Description : int NA NA NA 1 NA NA NA NA NA NA ...
## $ Houses.Damaged : int NA NA NA NA NA NA NA NA NA NA ...
## $ Houses.Damaged.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Deaths : int NA NA NA 1 NA NA NA NA NA NA ...
## $ Total.Death.Description : int NA NA 3 1 3 NA NA NA NA NA ...
## $ Total.Missing : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Missing.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Injuries.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Damage...Mil. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Damage.Description : int NA NA NA 1 3 NA NA 3 NA NA ...
## $ Total.Houses.Destroyed : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Houses.Destroyed.Description: int NA NA NA 1 NA NA NA NA NA NA ...
## $ Total.Houses.Damaged : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Houses.Damaged.Description : int NA NA NA NA NA NA NA NA NA NA ...
The original data set has 6365 observations of 39 variables.
1.3 Cleaning the data
After loading we have decided to perform the below cleaning our
dataset:
Exclude data dated before 1800
Our reasons are the following:
data available mostly from historical records therefore less
reliable
the measurement quality is not reliable based on less developed
methods
we had a lot of missing values from these records
Exclude data where magnitude is not available
the variable of magnitude has key importance in our analysis, the
records where it is missing we cannot trust.
## 'data.frame': 4028 obs. of 39 variables:
## $ Search.Parameters : chr "" "" "" "" ...
## $ Year : int 1800 1801 1802 1802 1803 1803 1804 1805 1806 1806 ...
## $ Mo : int NA 10 10 12 2 9 7 6 3 6 ...
## $ Dy : int NA 5 26 9 1 1 10 16 25 11 ...
## $ Hr : int NA NA 10 5 NA NA 13 8 NA NA ...
## $ Mn : int NA NA 55 NA NA NA NA 15 NA NA ...
## $ Sec : num NA NA 0 NA NA 0 NA 0 NA NA ...
## $ Tsu : int NA NA 5843 588 NA NA 593 NA NA NA ...
## $ Vol : int NA NA NA NA NA NA NA NA NA NA ...
## $ Location.Name : chr "IRAN: DAMAVAND" "MEXICO: OAXACA" "ROMANIA: CARPATHIAN FOLD,VRANCEA" "JAPAN: NW HONSHU: SADO ISLAND, OGI" ...
## $ Latitude : num 36.2 16.2 45.7 37.7 25.6 ...
## $ Longitude : num 53.3 -96.1 26.6 138.3 100.6 ...
## $ Focal.Depth..km. : int NA NA 150 NA NA NA NA 15 NA NA ...
## $ Mag : num 6.5 7 7.4 6.6 6 7.5 7.3 6.1 7.5 7.7 ...
## $ MMI.Int : int NA NA 10 NA 8 NA NA 8 8 NA ...
## $ Deaths : int NA 7 NA 19 200 NA 450 200 2000 100 ...
## $ Death.Description : int NA 1 NA 1 3 3 3 3 4 2 ...
## $ Missing : int NA NA NA NA NA NA NA NA NA NA ...
## $ Missing.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Injuries.Description : int NA 2 NA NA NA NA NA 3 NA NA ...
## $ Damage...Mil. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Damage.Description : int 3 3 2 3 2 3 4 3 4 2 ...
## $ Houses.Destroyed : int NA NA NA 1060 NA NA 10810 NA NA NA ...
## $ Houses.Destroyed.Description : int NA NA NA 4 2 3 4 3 3 3 ...
## $ Houses.Damaged : int NA NA NA NA NA NA NA NA NA NA ...
## $ Houses.Damaged.Description : int NA NA 2 NA NA 3 NA 3 NA NA ...
## $ Total.Deaths : int NA 7 NA 19 NA NA 450 200 2000 100 ...
## $ Total.Death.Description : int NA 1 NA 1 NA 3 3 3 4 2 ...
## $ Total.Missing : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Missing.Description : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Injuries : int NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Injuries.Description : int NA 2 NA NA NA NA NA 3 NA NA ...
## $ Total.Damage...Mil. : num NA NA NA NA NA NA NA NA NA NA ...
## $ Total.Damage.Description : int NA 3 2 3 2 3 4 3 4 2 ...
## $ Total.Houses.Destroyed : int NA NA NA 1060 NA NA 10810 NA NA NA ...
## $ Total.Houses.Destroyed.Description: int NA NA NA 4 2 3 4 3 3 3 ...
## $ Total.Houses.Damaged : int NA NA NA NA NA NA 300 NA NA NA ...
## $ Total.Houses.Damaged.Description : int NA NA 2 NA NA 3 3 3 NA NA ...
The original data set has 6365 observations of 39 variables.
After cleaning the dataset we have 4028 observations of 39
variables remaining. We will perform all the following analysis
with this dataset.
Linear regression models are unsupervised models, which means we want
to predict how the dependent variable changes with changing the
independent variables. Regression models the dependent variable takes
quantitative measures.(this means Y value is an increasing or decreasing
number whith changing the value of the dependent variables.)
Generally linear models are never completely correct, but the
interpretability of the linear model is relatively high compared to
other more complex models. The danger of overfitting is generally less
with linear models.
Linear regression can be performed with fitting a single predictor or
a multiple predictors variables at the same time. Generally one starts
with fitting all the relevant predictors.This is the reason, we start
our statistical analysis with a multiple linear regression below.
3.2 Our research questions
The scope of this analysis is to be able to make predictions about
number of deaths caused by earthquakes. This is to understand how
magnitude, intensity, focal depth,time and location correspond to the
total number of deaths by earthquakes.
We are going to look at which independent variables have an effect on
the:
Y1: total number of deaths (with secondary effects):
[Total.Damage…Mil.]
Y2: total cost of damage (USD) (with secondary effects):
[Total.Deaths]
Remark: We are going to use the backward evaluation model, which
means we are fitting all relevant the independent variables available in
our data set, and removing them one by one looking at how the
significance changes.
In both cases we will examine the below independent variables:
magnitude [Mag]
intensity [MMI.Int]
focal depth [Focal.Depth..km]
Year [Year]: Remark: month, date, day hour and sec we do not
include, given we do not see an added value here.
Location [Latitude],[Longitude]
Volcano, Tsunami: [Vol],[Tsu]
Location name: We do not include this either, given it would lead
to too many levels, which is problematic using linear regression. We
have the Lattitude and longitude, included which has an indication on
location.
3.3 Data Cleaning and fitting linear model
Total Death numbers, total damage costs, intensity and focal depth
are not available in each row of our data set. Given we cannot predict
these values we have decided to remove the rows, where any of these
values is NA. We are also turning NA`s in Tsu and Vol column to 0 in
order we can fit the linear model below.
Our final data set eqdata.no.na.total.death has 678 rows.
Above we can see the Adjusted R2 value is 0.0526, which means that
our model explains only 6 % the relationship of the variables. This is a
very low result.
The individual variables except 2 of them are larger than 0.005 which
indicates, that they do not seem to have a meaningful effect on our
dependent variable.
In the code below, we will be removing them one by one, and will not
comment those results where the results do not change meaningfully.
As a result we end up keeping the Magnitude and Intensity which are
the 2 independent variables which seem to be relevant for our model.
##
## Call:
## lm(formula = Total.Deaths ~ Mag + MMI.Int, data = eqdata.no.na.total.death)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11671 -4284 -1815 636 233936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20826.0 4245.5 -4.905 1.17e-06 ***
## Mag 1796.1 695.0 2.584 0.00996 **
## MMI.Int 1471.6 359.7 4.092 4.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14720 on 675 degrees of freedom
## Multiple R-squared: 0.05645, Adjusted R-squared: 0.05366
## F-statistic: 20.19 on 2 and 675 DF, p-value: 3.04e-09
Colinearity
Now we are looking at the correlation matrix, to check whether there
is a collinearity between Magnitude and Intensity.
As we can see the correlation coefficients are not too high, so there
is no colinearity between Magnitife and Intensity.
An other indicator to check this is the VIF value: Which is not high
either.
Finally, let us check an interpret the coefficients of the linear
model.
The intercept is below zero, which is very hard to interpret for our
variable. This may not even make sens.
The coefficient for magnitude tells us, that with each unit increase
of a magnitude the number of deaths increases with 1796.
The coefficient of the intensity, tells us, with unchanged value of
the intercept and the magnitude, if the intensity is one unit higher, it
is likely that the number of deaths increases additionnaly with
1471.
3.4 Graphical analysis Magnitude and Intensity
Let`s look at in 2 dimension, how our data plots Magnitude and
Intensity plotted against the number of total deaths looks like:
The blue line ia a regression line. We can see in both cases that our
linear regresion line on both cases does not really fits to the data
points especially at higher values of magnitude and Intensity.
If we plot in 3D our data points, as well as the plaine, which
visualizes our linear model, looking at the data we have the below
assumptions:
There is non-linearity in our model
Data is not normally distributed
There is heteroskedasticity
The data contains outliers and high leverage points
To investigate on our assumptions, we plot the residual against
the fitted values.
Figure below for Magnitude vs. Total Death:
Figure below for Intensity vs. Total Death:
In both figures we can see 4 plots which we explain below:
Residuals vs Fitted Values
The red line is linear, however towards higher values majority of our
data points start to be divide from the line. This is the place where we
also have our outliers.This may indicate some non-lineriality in our
data.
Normal Q-Q
This chart we can see how the distribution of the data is. In case
the most data points lie in the line, the data is normally distributed.
In our case towards higher values the majority of our data points
divergates from the line, this means that in our model the residuals
aren’t Gaussian and thus your errors aren’t either, so our data is not
normally distributed.
Scale Location
This chart we can see if there is heteroskedasticity in our data. On
both charts, the red line is approximately horizontal, this means that
the average magnitude of the standardized residuals isn’t changing much
as a function of the fitted values. However the distribution of the
datapoints concentrates around the value 5000 and in lower and higher
values we have significantly less datapoints. This indicates
heteroskedasticity.
Residuals vs Leverage
In the plotted data we can already see that there are some data
points “far” away, or outlying. This chart shows which outliers are high
leverage points, which means their change or removal influences more our
model as the other datapoints. At Magnitude as well as at Intensity we
have few such a points which lie outside of the line indicating the
Cook`s distance.
To sum up our findings above, we can conclude, that our linear
model fitted with Intensity and Magnitude is not the best fit for
predict the number of deaths caused by earthquakes. Therefoer below we
are going to look at more complex models.
4. Generalised Linear Model Poisson
5. Generalised Linear Model Binomial
6. Generalised Additive Model
Difference GLM vs. GAM GAM do not assume a priori any specific form
of this relationship, and can be used to reveal and estimate non-linear
effects of the covariate on the dependent variable.